AdvGLUE

The Adversarial GLUE Benchmark

Performance of FreeLB (single model) on AdvGLUE

Overall Statistics

96.4Accuracy 66.549.082.361.792.692.6F1 Accuracy 33.127.240.587.778.942.231.195.0Accuracy 70.657.747.462.386.7Accuracy 66.756.662.290.6Accuracy 38.521.131.690.60100Accuracy 35.7010018.1010026.4010027.60100
GLUE DevAdvGLUE WordAdvGLUE SentenceAdvGLUE HumanAdvGLUE OverallSST-2QQPQNLIRTEMNLI-mMNLI-mm

Performance of FreeLB (single model) on each task

The Stanford Sentiment Treebank (SST-2)

53.179.167.669.161.4Typo Knowledge Embedding Context Composition 39.463.5Syntactic Distraction 82.30100CheckList
Adversarial AccWordSentenceHuman

Quora Question Pairs (QQP)

36.05.919.028.639.7Typo Knowledge Embedding Context Composition 27.311.115.013.835.540.5Syntactic 87.70100CheckList 78.90100
Adversarial AccAdversarial F1WordSentenceHuman

MultiNLI (MNLI) matched

37.037.526.244.738.6Typo Knowledge Embedding Context Composition 18.425.90100Syntactic Distraction
Adversarial AccWordSentence

MultiNLI (MNLI) mismatched

45.934.529.627.635.4Typo Knowledge Embedding Context Composition 14.224.8Syntactic Distraction 29.023.70100StressTest ANLI
Adversarial AccWordSentenceHuman

Question NLI (QNLI)

69.473.265.866.375.5Typo Knowledge Embedding Context Composition 47.571.7Syntactic Distraction 64.935.00100CheckList AdvSQuAD
Adversarial AccWordSentenceHuman

Recognizing Textual Entailment (RTE)

65.267.781.463.054.5Typo Knowledge Embedding Context Composition 47.772.90100Syntactic Distraction
Adversarial AccWordSentence